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Abstract 

This paper proposes a new estimation algorithm for the parameters of an HMM 
as to best account for the observed data. In this model, in addition to the obser- 
Kj^ _ vation sequence, we have partial and noisy access to the hidden state sequence as 

Q^ ' side information. This access can be seen as "partial labeling" of the hidden states. 

in 

'nT ■ Furthermore, we model possible mislabeling in the side information in a joint frame- 

rn 

^^ ■ work and derive the corresponding EM updates accordingly. In our simulations, we 

observe that using this side information, we considerably improve the state recog- 
nition performance, up to 70%, with respect to the "achievable margin" defined 



j^ , by the baseline algorithms. Moreover, our algorithm is shown to be robust to the 

training conditions. 

Key words: HMM training, ML estimator, side information, partial, noisy. 



* Corresponding author. 

Email addresses: huozkan@ku.edu.tr (Huseyin Ozkan), 

arda.akman@turktelekom.com.tr (Arda Akman), skozat@ku.edu.tr (Suleyman 

S. Kozat). 

Preprint submitted to Digital Signal Processing 28 February 2013 



1 Introduction 



In a wide variety of applications in time series analysis ranging from speech 
processing |IHZ], bioinformatics [8l|9] to natural language processing [TOHTS] . 
the observation sequence is represented as a stochastic process, depending 
on another stochastic process which generates a sequence of hidden (unob- 
served) states. With certain properties regarding the observations as well as 
the states, this is known as Hidden Markov Model (HMM) [T]. In this pa- 
per, we particularly concentrate on discrete-time finite-state HMM with finite 
alphabet, which is described by two random variables: the hidden state Zt 
and the observation yt- The state sequence forms a stochastic, discrete-time 
Markov chain and the probability of an observation yt does only depend on the 
state Zt- Hence, as shown in Fig. [1^, an HMM is completely characterized by 
the state transition probabilities, Aij, the observation emission probabilities, 
Bij, and the initial state probabilities ttj. A detailed description of the model 
can be found in [1]. Estimation of these model parameters, v4jj, Bij and ttj, is 
an important problem in applications using HMM PQEIEIIHHII] ■ Since there is 
no closed form solution for the set of parameters that maximizes the probabil- 
ity of the observation sequence given the model, instead, iterative algorithms 
such as the Expectation-Maximization (EM) algorithm [TB] (or equivalently 
the Baum- Welch method [16]) is used to obtain a local optimal solution |1]. In 
this paper, we derive a new set of iterative EM equations that yield a locally 
optimal solution for the model parameters, when the ordinary model of the 
observation sequence, e.g., as in [T], is different. In our model, in addition to 
the observation sequence j/t, we observe a part of the hidden state sequence 
as side information. More precisely, at every time instant t, we observe the 



hidden state with probabihty r, i.e., with 1 — r probabihty the state is hidden. 
This gives partial access to the state sequence and hence, leads to a new model 
different from the ordinary HMM. We emphasize that the state observations 
are not necessarily confined to a time interval but may even be sparsely and 
randomly distributed along the complete time span of the application. In the 
limiting case, if r is 0, then there would be no state observation, and we 
recover the ordinary, unsupervised HMM training. Therefore, our model pro- 
vides a generalized framework by letting partial access to the state sequence. 
Moreover, we also allow that a state observation might be corrupted with 
noise such that if zt is ever observed, say as Xt, then P{zt ^ Xt) = 1 — p, as 
shown in Fig. [TJo. Under these new circumstances, we explicitly provide the 
mathematical derivations of the new set of iterative EM equations that incor- 
porates the side information and estimate the model parameters accordingly. 
In these derivations, the probability that a state observation is incorrect, 1 —p, 
is assumed to be known and it is provided to our algorithm as a parameter, p, 
which defines the confidence on the side information. Simulations show that 
our method is robust to the confidence parameter p, even if it does not exactly 
match with the underlying true quality of the side information, ptrue- 

Since the hidden state sequence is partially observed, our work falls into the 
category of Partially Hidden Markov Model (PHMM) training (note that this 
term is used in [17] in a different context). Similar to semi-supervised learning, 
PHMMs use both "labeled" (in our context the state information) and "unla- 
beled" data to obtain improved model training. Such an approach is suitable, 
when we have access to a limited amount of labeled data along with a large 
amount of unlabeled data. This happens, as an example, in speech processing 
applications [7J, where labeling, i.e., transcription, is naturally costly [TlflS]. 



© © 



B 



^lyi 



B 



Z2V2 



'■Z1Z2 



B 



ziyi 



B 



'A, 






5 



ztVt 




B 






P{siuitch ON) = T 

P(xt / zt\switchON) = 1 -p 



A 



ZTVT 



ZT-lZT 



{^t 



(a) (b) 

Fig. 1. (a) An example HMM with discrete-time finite-state, zt, and observations, 
yt, of a finite alphabet, (b) Observation Model with Noisy and Partial Access to the 
State Sequence. 

hence only limited amount is affordable, and transcriptions may contain er- 
rors. Furthermore, by allowing noisy access to the states, we model "misla- 
beling" event that may occur during labeling stage. PHMMs, to the best of 
our knowledge, date back to the studies [TTHT3] in the area of Natural Lan- 
guage Processing. In these studies, tagged text, corresponding to the known 
states of a PHMM, are first analyzed through a relative frequency modeling 
to construct an initial model, then this model is fed into the ordinary HMM 
training algorithm. However, these studies do not rigorously show how the 
partial state information is incorporated within the ordinary HMM parame- 
ter learning framework. The Maximum Likelihood Estimator (MLE) for the 
model parameters in a special case of PHMMs, where only a certain state from 
the state space in the underlying Markov chain is known, is theoretically (con- 
sistency and asymptotic normality of the estimator) analyzed in [10]. However, 
the equations for computing the MLE (using the EM algorithm or other Like- 
lihood maximization techniques) in this special case of PHMM is not derived. 
In [13], iterative EM equations for a general case, where each observation can 
only belong to a pre-defined set of acceptable states are given, but no com- 
plete derivation is provided. On the contrary, we explicitly derive the new set 
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of iterative EM equations for the PHMM parameter learning problem, when 
there is partial access to the underlying hidden state sequence. Furthermore, 
the partial observation of the state sequence might be prune to noise in our 
model and this case is not considered in the existing literature. 

After we provide the brief description of the basic HMM framework and the 
parameter estimation equations in Section II, we derive the new set of iterative 
EM equations that incorporates partial and noisy access to the state sequence 
as side information in Section III. Simulations are presented in Section IV and 
the paper concludes with final remarks in Section V. 

2 Problem Description 

In this section, we briefly describe the basic HMM framework [T]. For the 
sake of notational simplicity, we study discrete-time finite-state HMM with 
finite alphabet. However, our derivations for incorporating the side informa- 
tion in Section HI can be readily extended to the case, where the observations 
come from a continuous distribution and outcomes are vectors. A discrete-time 
HMM with finite alphabet is formally a Markov model, for which we have a se- 
quence of observations, y*, drawn from a finite alphabet V = {vi, V2, ■■■, Vn^}, 
i.e., Vt & V,l < t < T. We also have a sequence of hidden (unobserved) 
states Zt E S = {si, S2, ..., sat^}, where 5" is the set of possible states, gen- 
erated from a Markov process, i.e., P{zt\zt-i, zt-2, ■■■, zi) = P{zt\zt-i). The 
observation sequence, yt, is generated based on the state sequence Zt, i.e., 
P{yt\zt, Zt-i, ...,Zi,yt-i, ...jIIi) = P{yt\zt). We consider A as the transition 
matrix, where Aij represents the transition probability from state Sj to Sj, 
Aij = P{zt = Sj\zt-i = Si). Similarly, B is the observation probabilities at each 



state, i.e., Bij = P{yt = Vj\zt = Sj). In order to complete the HMM observation 
model, we also define the initial state distribution vector as ttj = P{zi = Si). 
Thus, A = {A, B, n) represents the parameter set that completely characterizes 
the HMM model shown in Fig. [1^. 

When we have access to only the observation sequence, i.e., without labels, 
then the iterative EM equations, which provide a locally optimal solution for 
the HMM parameters A, are given in [T]. These equations are obtained through 
the likelihood maximization, i.e., argmax^ P{Y\X), Y = {yi, y2, ■■■, yr}, which 
is carried out with the well known forward-backward procedure [T9l|20]. To 
describe this procedure, we first define the forward variable, at{i), along with 
the recursion in [T] as 



at{i) = P{yu y2, ...,yt, zt = Si\X) 

= Biy^ J2 o^t~iiJ)Aji, ai{i) = TTiBiy^, 2 < t < T, 



which is the probability of observing Y^ = {yi,y2, ■■■,yt} and being at state 
Zt = Si, given the model A. Similarly, the backward variable is given by 

Pt{i) = Piyt+uyt+2, •••, yxlzt = s^. A) 

= J2pt^,{j)A,B,y^^,,l< t < T- 1,/?t(2) = 1, (2) 

which is the probability of observing Y^-^^ = {yt+i, yt+2, ■■■,yT}, given the state 
Zt = Si and the model. Based on these, we define the probability of transition 
at time t from state Si to Sj, given the observations Y (note that, for ease of 
notation, we drop the subscript and superscript of Y^ when denoting the set 
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of all observations, i.e., Y = Y^ = {yi, 2/2, •••, Vt}) and the model as 



et{i,j) = P{zt = Si,zt+i = Sj\Y,X) 

(3) 






E£iEr=i«,(A;)AHi?,,,^,A+i(0 
Here, if we sum et{i,j) over the time index t, then we obtain the expected 
number of transitions from state Si to Sj in the hidden state sequence Z = 
{zi, Z2, ..., zt}- Next, we define the probability of being at state Zt = Sj, given 
the observations and the model as 

lt{^) = T.^t{^,J). (4) 

i=i 

Similarly, the summation of 7i(i) over the time index t yields the expected 
number of times in the state Sj. 

Then, given the definitions in ([I])-(I1]) and the observation sequence Y, we 
estimate the HMM parameters using the likelihood maximization through the 
EM algorithm [15], i.e., A = argmaxA P(y|A). The iterative EM equations 
that solve this maximization problem (at least locally) are as follows: 

Et=i iM Ei=i iM 

Here, given the training data, we estimate the HMM parameters A by the iter- 
ative re-estimation procedure defined by the EM algorithm. Namely, given the 
HMM parameters Aq_i at an iteration q, we re-estimate the model parameters 
as Xg using the re-estimation formulas in ([5]). This procedure is guaranteed to 
improve the likelihood of the observations at every iteration and converge to 
a set of HMM parameters A, which is at least locally optimal (cf. jl] and the 
references therein). 

In the following section, we derive the new set of iterative EM equations that 



incorporates the noisy side information into the HMM framework. 

3 HMM training with noisy and partial access to the state se- 
quence 

In this section, we derive the new set of iterative EM equations for the HMM 
parameter learning, when we have noisy side information on the hidden states. 
Here, we have an observation sequence yt eY = {yi,y2, ■■■,yT}, with partial 
and noisy access to the hidden states, zt E Z = {zi, Z2, ...^zt}, as this side 
information. Each hidden state z E Z might be observed as x with probability 
r, i.e., we do not necessarily have a state observation at a given time instant. 
Hence, we have partial access to the hidden state sequence. In addition to this 
partial access, a state observation x might also be noisy such that P{z ^ x) = 
(1 — p). We assume that if an error happens, then P{x = s) = ^y^, Vs G 5 
s.t. z ^ s. For ease of notation, we define the state observations at every time t 
a.s Xt E X = {xi, X2, .-., xt}, such that if zt is ever observed as x, then xt = x. 
Otherwise, Xt = so, where sq is a pseudo-state. This expands our state space 
to 5" = 5* U {sq}. Thus, we model mislabeling and partial labeling jointly in 
one complete framework as shown in Fig. [TJd. 

In order to incorporate the side information into the new framework, we first 
define the updated variables of the forward-backward procedure, which will be 
later used in derivation of the new set of iterative EM equations. The updated 
forward variable, 

at{i) = P{Yl,Xlzt = s,\X), (6) 

is the probability of observing (F/,Xj = {xi,X2, .■■,Xt}) and being at state 



Zt = Si, given the model A. Note that zt is the correct and the underlying 
hidden state, whereas X^ are the state observations, for which we might have: 
(1) Xt = So corresponding to the case that Zt is not actually observed and (2) 
noisy, if zt is actually observed. Similarly, the backward variable, 

m = piytli,xi,\zt = s.,x), (7) 

is the probability of observing (F^-^^, X^-^^^ = {xt+i,Xt+2, ■■■,xt}), given the 
model and the state Zt = Sj. The updated forward and backward variables 
are the key variables of the new framework, which incorporate the side infor- 
mation. The following proposition explicitly relates these variables to the side 
information and provides the corresponding recursions. 

Proposition 1: For the updated forward and backward variables defined in 
(E]) and ([7]), we have 

Ns 

at{i) = iy{xt, Si)Biy^ ^ A.jiat^i{j), 2 <t <T, 

Ns 

Pt{i) = J2 '^i^t+uSj)/3t+i{j)AijBjy^^^,l < t < T - 1, 

where iy{xt,Si) = l{x,=so}(l - ^) + ^{xt=s,}rp + ^{xt^s^Axt^so}^^, Si ^ Sq, 
Sj 7^ So and l^h} is the indicator function such that l^h} = 1 if /i is true and 
l{h} = 0, otherwise. 

Proof: Using the marginalization over the random variable Zt_i, we can obtain 
dt{i) as 

Ns 

atii) = Y.P{Yl,Xlzt = Si,zt-i = Sj\X), 



which can be expressed, using the product of conditional probabihties, as 
at{^ = Y.P{yt,Xt,Zt = s^\Yl~\Xl-\zt-i = s„ X)P{Yl-\Xl-\ zt-i = s,\X). 
By definition of the updated forward variable, we get 

i=i 
where Markov Property is applied to reach 

Ns 

o^t{i) = Yl ^iVt-' ^<' ^< = Si\zt-i = Sj, X)at-i{j) 
i=i 

= YP{yt,Xt\zt = Si,zt-i = Sj,X)P{zt = Si\zt-.i = Sj,X)at-i{j) 
i=i 

Ns 

= Y^ P{yt, xt\zt = Si, X)P{zt = Si\zt^i = Sj, X)at-i{j). 
Since Xt and yt are independent conditioned on {zt. A), we obtain 

Ns 

"t(0 = J2 ^(yt^ ^t\^t = Si, X)Ajiat-i{j) 
i=i 

Ns 

= J2 P{xt\zt = Si, X)P{yt\zt = Si, X)Ajiat^i{j). 

Then, by definition of the probability of error events in the side information, 
we get the proposition for the updated forward variable as 

Ns 

atii) = v{xt, Si)Biy, Y AjiO^t-iiJ)-, 2<t<T. 
i=i 

As for the initialization, we set ai{i) = v{xi,Si)TiiBiy^. Similarly, the corre- 
sponding recursion for the updated backward variable can be found as 

Ns 

PS) = J2 '^i^t+i, Sj)l3t+i{j)AijBjy^^^, l<t<T-l, 
i=i 

for which we have the initialization Pri'i-) = 1-^ 

Here, p refiects the confidence that we have on the side information and it is a 

parameter in our PHMM training algorithm. Ideally, when given a set of data. 
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p (named as Strain in Section IV) should be set according to the underlying true 
noise level, 1 — ptme, which is unknown. This brings an immediate trade-off 
between setting the confidence too low or too high, when an accurate guess 
about 1— ptruc is not present. If we have too high confidence, then our algorithm 
basically overfits to the noise in the side information, which degrades the state 
recognition performance as discussed in Section IV. On the other hand, if we 
have too low confidence, then our algorithm does not fully exploit the side 
information to its limit. We discuss this later in Section IV, when investigating 
the robustness of our algorithm to the confidence parameter p (ptram in Section 
IV). 

We next define the probability of transition at time t from state Si to Sj, given 
the observations Y, the side information X, and the model as 

et{i,j) = P{zt = Si,zt+i = Sj\Y,X,X), (8) 

which is essential to the estimation of the HMM parameters in our new frame- 
work. Note that the summation of et{i, j) over the time index t is the expected 
number of transitions from state Sj to Sj, when we have side information in 
addition to the observation sequence. 

The following proposition relates et{i,j) to the updated forward and backward 
variables. 

Proposition 2: With the definitions in ([6]) and ([7]), we have 

et{i,j) = P{zt = Si,zt+i = Sj\Y,X,X) 

P{Y,X\X) 
where u{xt+i,Sj) = l{^,+i=so}(l - r) + l{^^^,=,^}Tp + l{xt+i^s,Axt+i^so}jrzT, 
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Sj 7^ So and l^h} is the indicator function such that l^h} = 1 if /i is true and 
l{h} = 0, otherwise. 

Proof: Sphtting the observations as F = (F/, y^+i, 1^^2)5 ^^^ the side in- 
formation as X = {X{,Xt+i,X'[_^2)^ & yields 



etU,JJ 



P{Y,X\X) 

P{zt = Si^X'^,Xt^i,Y^,yt+i\zt+i = Sj,Xj_,_2) ^t+2i ^)-P(^t+i = "5i)^t+2i "^+2!^) 



P{Y,X\\) 

Since {zt = Si, Xl, xt+i,Y^ , yt+i) is independent with (-^^4+25^4+2) conditioned 
on (-2i+i, A), we obtain 

-/. -N _ Pj^t = Sj, X-^^, xt+i,Y^ , yt+i\zt+i = Sj,\)P{zt+i = Sj,X^_^_2,Y^_^_2\\) 
^*^''^^~ P(Y,X\\) 

which, re-arranging the conditional probabilities, yields 

P{zt = Si,zt+i = Sj,X-^^,xt+i,Y^,yt+i\\)P{X^^2^Yt+2\^t+i = Sj,X) 



etU,J. 



P{Y,X\\) 
P{zt+i = Sj, xt+i, yt+i\zt = Si, Xi, Yi , \)P{zt = Sj, X^, Y^ \XjP{X^_^_2, Yf-_^_2\zt+i = Sj, A) 



P{Y,X\X) 

Since {zt+i = Sj, xt+i, yt+i) and (X{, F/) are independent conditioned on {zt = 
Si, A), and recognizing the terms at(^) and A+i(i), we obtain 

-(. .^ _ Pjzt+i = Sj,xt+uyt+i\zt = Si,X)atii)/3t+iiJ) 
'*^''-^^" PiY,X\X) 

_ Pixt+i,yt+i\zt+i = Sj,zt = Si,X)P{zt+i = Sj\zt = Sj, X)atii)(3t+iij) 

P{Y,X\X) 

wherein, Markov Property is used to reach 

_.. .. _ P{xt+i,yt+i\zt+i = Sj,X)P{zt+i = Sj\zt = Si,X)at{i)}3t+i{j) 
''^'^^>- PiY,X\X) 

Since Xt+i is independent with yt+i conditioned on {zt+i, A), we obtain 

_.. _ P{yt+i\zt+i = Sj,X)P{xt+i\zt+i = Sj,X)P{zt+i = Sj\zt = Sj, X)atii)l^t+i{j) 
'*^''-^^~ PiY,X\X) 

12 



Then, due to the definition of the probabihty of error event in the side infor- 
mation, we get the proposition as 

'*^''^^ - PiY,X\X) •■ 

Finally, we define the probability of being at state Zt = Si, given the observa- 
tions, the side information and the model as 

7i(^) = P{zt = s,\Y, X,X) = Y^ et{t, j). (9) 

Note that summation of 7* (i) over the time index t is the expected number of 
times we are in the state Sj, when we have side information in addition to the 
observation sequence. 

Next, we derive the new set of equations that incorporates the side infor- 
mation X. For this, the parameter set A will be selected such that the log- 
likelihood of the training data (observations and the side information), i.e., 
F = log iP{X, Y\X)V is maximized via using an auxiliary function p!| [20l[2T] . 
Instead of maximizing F, we maximize the auxiliary function F' through the 
EM algorithm. Let Q{Z) = P{Z\X, Y, A') be the output of E-step. Then, with 
respect to A, M-step maximizes 



.-i:.(..M^^<|;^: 



The following theorem provides the main result of our work for incorporat- 
ing the side information within the framework of the basic HMM parameter 
learning problem. 

Theorem: With the definitions in (jH]) and ([9]), the maximization of F' through 
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the EM algorithm is convergent (at least locally) to 

Et=i 7H^) Ei=i 7t(^) 

Proof: We give an outline for the proof. Let Q{Z) = P{Z\X,Y, X') be the 
output of E-step, then M-step carries out the following maximization: 

PiX Y Z\\) 
argmaxF' = argmax^ Q(Z) log( -^-^ ) (10) 

= argmax^Q(Z)('log(F(X,y,Z|A))-log(g(Z))y (11) 

Since Q{Z) is the output of E-step, it does not depend on A. Hence, if we split 
the log division in ( ITOj) into subtraction, then we can drop the second term 
(subtrahend) in ( [TT|) and obtain the maximization 



argmaxF' = argmax^Q(Z) log (P(X, y, Z\xf 
A A ^ ^ 

which, using the product of conditional probabilities, yields 

argmaxF' = argmax^ Q(Z) log [P{Y\X, Z, X)P{X\Z, X)P{Z\X)). 



' z 



Since X is independent with A conditioned on Z and Y is independent with 
X conditioned on {Z, A), we obtain 

argmaxF' = argmax^ Q(Z) log (P{Y\Z, X)P{X\Z)P{Z\X)), 
A A ^ V 

where we can drop the term P{X\Z) since it does not depend on A and reach 
argmaxF' = argmax^ g(Z) log (P{Y\Z, X)P{Z\X)). (12) 

A A ^ V 

We point out that the maximization in (fT2|l does not involve the side infor- 
mation X, except that Q{Z) = P{Z\X,Y, X') is related to X. However, since 
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Q{Z) is calculated in E-step before M-step starts in the course of our algo- 
rithm, it only brings constant factors to the maximization in ( !T2l) and, hence, 
it does not affect the M-step derivations. Therefore, rest of the derivations 
follows the regular M-step derivations of the EM algorithm for the ordinary 
HMM parameter training and we estimate the transition probabilities as 

J2z QyZ) Y.t=i ^{zt=s,hzt+i=sj} 



A. 






j:zJ:lJi'i{z,=s.}P{z\x,Y,y) ' 

where the indicator function in the numerator and the denominator marginal- 
izes the probability P{Z\X,Y,X'), since the outer summation is over all pos- 
sible hidden state sequences. Hence, we obtain 

^ _ J2t=i P{zt = ^S^, Zt+1 = Sj\X, Y, X) 

''" ElJiP{zt = Si\X,Y,X') 

which is, given the side information, the expected number of transitions from 
state Si to Sj divided by the expected number of times in the state Sj. Similarly, 
Bij is given by 

which is, given the side information, the expected number of times in the state 
Si and observing Vj, divided by the expected number of times in the state Sj. 
Also, the set of initial probabihties for the hidden state zi is estimated as 

Based on the new set of equations as well as the recursions defined in Propo- 
sition 1, we incorporated possibly corrupted side information into the HMM 
training framework. In the next section, we provide examples that demonstrate 
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the performance of the new set of training updates under different scenarios. 



4 Simulations 

In this section, we demonstrate the performance of our method through sim- 
ulations using data generated with the following HMM parameters: 



Ns = 3,N^ = 3,7r 



0.3 0.3 0.4 



and A 



0.8 0.19 0.01 



0.01 0.8 0.19 



0.19 0.01 0. 



,B 



0.6 0.3 0.1 



0.1 0.6 0.3 



0.3 0.1 0.6 



For these simulations, we have a test set of 500 data points and a training 
set of 250 data points along with the side information of a relatively high 
noise level, 1 — ptme = 0.4, and a relatively low noise level, 1 — ptrue = 0.2, 
with r ranging from to 0.6. We emphasize that the exact noise level may 
not be known by the algorithm. Hence, we provide Ptram to the algorithm 
which may not be equal to the ptruc- Here, the parameter ptrain reflects the 
confidence (equivalently the expected noise level) that we have on the side 
information. Since this confidence on the side information might not be accu- 
rate, i.e., Ptrain docs uot uecessarily match with ptmc, for analyzing the sen- 
sitivity of our method to the confidence parameter, we train our algorithm 
with different choices for ptrain^ (1) we set confidence that is in the proxim- 
ity of Ptmc (Ptrain ~ Ptmc), l-G., Strain ^ {0.55,0.6,0.65}, if Ptmc = 0.6 and 

Strain ^ {0.75,0.8,0.85}, if Ptme = 0.8, (2) wc set too high confidence on the 
side information (ptrain > Ptme), i-e., ptrain = 1, when ptmc G {0.6,0.8} and. 
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(3) we set too low confidence (ptrain < Ptme), i.e., ptrain = 0.5, when ptruc = 1- 
Using the training set, we first estimate the unknown model parameters, Aij, 
Bij, and ttj^. Then, on the test set, the hidden state sequence is estimated 
by the Viterbi algorithm [22l|23] using the estimated model parameters. This 
process is repeated 500 times and we present the average state recognition 
error rates for all the cases aforementioned. In order to show the efficacy of 
incorporating the side information by our method, we compare the state recog- 
nition error rates of our algorithm with: (1) Baseline Performance, the state 
recognition error rate if the model parameters are estimated by the ordinary 
HMM parameter estimation. This is the performance, which is readily achiev- 
able with no side information. (2) The Oracle, the state recognition error rate 
if the true model parameters are directly used in the state estimation on the 
test set. This is the performance limit if the HMM training algorithm is run 
on infinite amount of training data, which is only asymptotically achievable. 
(3) Limit of Algorithm, the state recognition error rate if the side information 
is completely accurate and the algorithm is trained with complete confidence 
on the side information, i.e., ptme = 1, Ptram = 1- Finally, this is the per- 
formance limit that our algorithm can gain at most by exploiting the side 
information. Here, we name the difference between the Baseline Performance 
and the Oracle as the "achievable margin" since no algorithm can obtain state 
recognition improvements more than the achievable margin, provided that, as 
in this work, first the model parameters are estimated and then used in the 
Viterbi algorithm for state recognition. 

Our simulations show that the performance of our method, provided that 
Ptrain ~ Ptruo improvcs with the amount of side information that is indicated 
by r. In particular, when we have accurate access to the hidden states, i.e.. 
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Fig. 2. Simulation results for different scenarios. Our algorithm is trained with 
Ptrain £ {0.55,0.60,0.65} when ptrue = 0.60 and ptrain G {0.75,0.80,0.85} when 
Ptrue = 0.80. The State Recognition Error Rates are estimated by the Viterbi algo- 
rithm. Performance of our algorithm is compared against three performance limits: 

(1) Baseline Performance, error rate by ordinary HMM using no side Information, 

(2) Oracle, error rate if the true model parameters are used in state recognition, 
and (3) Limit of Algorithm, ptrain = Ptrue = 1- See the text for details. 

PtTue = Ptrain = 1, the state recognition rate in the test set, labeled as Limit of 
Algorithm in Fig. [21 consistently approaches to the Oracle as r increases and 
reaches ~ 90% gain (the performance improvement over the baseline corre- 
sponds to ~ 90% of the achievable margin) with 30% additional information 
on states, i.e., r = 0.3, as shown in Fig. [2l This proves the efficacy of our 
method with incorporating the side information. On the other hand, in the 
case of noisy access to the hidden states such that 20% of the state obser- 



vations are mislabeled, i.e., ptme = 0.8, our method (when Strain ~ Ptruc) is 
able to provide substantial gain, 70%, at r = 0.3. In this case, as r increases, 
the recognition approaches to Limit of Algorithm showing that our algorithm 
optimally incorporates the side information under noise asymptotically. Even 
if the noise level is further increased up to a level as high as 40% mislabeling, 
we still obtain a gain that consistently increases with r, when ptram ~ Ptme- 
Thus, our method is robust to noise. Nevertheless, the algorithm must not 
rely on the side information with too high confidence. Specifically, when we 
have the confidence Ptrain = 1 in case of high noise level, i.e., ptmc = 0.6, we 
do not obtain any improvement compared to the baseline. On the contrary, 
the algorithm does not fully exploit the side information to its limit, if the 
confidence is too low. For instance, in case of ptram = 0.5 and ptruc = 1? the 
rate of performance improvement with r is significantly slower than that of 
Limit of Algorithm, i.e., ptrue = 1; Ptram = 1- According to our simulations, 
setting the confidence in the proximity of the true noise level is sufficient to 
obtain the maximum gain, i.e., our algorithm does not require an exact match 
between ptmc and ptram- This demonstrates that our algorithm is also robust 
to the mismatches in the confidence parameter ptram- 



5 Conclusion 

In this paper, we introduced a new parameter estimation algorithm for HMM, 
when we have partial and noisy access to the hidden state sequence as side 
information. This side information can be seen as partial labeling, "possibly 
wrong" , of the hidden states. In this work, we model mislabeling and partial 
labeling of the hidden states jointly in one complete framework. This frame- 
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work naturally recovers the unsupervised HMM training if the partial access 
to the hidden states is turned off. In our simulations, we observed that, using 
this side information, we considerably improved the state recognition perfor- 
mance, up to 70%, with respect to the "achievable margin". Moreover, our 
method is shown to be robust to the training conditions. Finally, since this 
framework includes possible mislabeling events, our algorithm models realistic 
training conditions more accurately than the ordinary HMM training. Hence, 
we expect the same performance improvement in other examples. 
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